Annotated Transformer

from Transformer

http://nlp.seas.harvard.edu/2018/04/03/attention.html

transformerの論文の解説

notebook形式なのでpytorch動かしながら読める（らしい。普通に上のページ読むだけでもよくわかる

pytorchで簡単なモデル組んだ経験があれば読めそう

あとAttentionの理解を先にしておくと良さそう

一部、モデル構造をforward時に渡してたりして気持ち悪い

まあでも読める

PreNormとPostNorm

Encoderのレイヤーにおいて、元論文はPostNormなのに対してこの解説はPreNormなのに注意

https://gyazo.com/b1da12cc5407a20e587bda23a3dc3e57

このようなツイートを見かけたが、自分が見る限りはtensor2tensorの実装はPostNormだったと思う

と思ったが、違うな。initial commitはPostNormだったっぽいが、後からprenormになった

That is, the output of each sub-layer is https://gyazo.com/2dbb2192c8110740564f7244672925b5

is the function implemented by the sub-layer itself. We apply dropout (cite) to the output of each sub-layer, before it is added to the sub-layer input and normalized.

とあって、これは論文中にもあるが、

その下の実装は明らかにprenormなので、解説としてはたちが悪い

code:py

class SublayerConnection(nn.Module):

"""

A residual connection followed by a layer norm.

Note for code simplicity the norm is first as opposed to last.

"""

def __init__(self, size, dropout):

super(SublayerConnection, self).__init__()

self.norm = LayerNorm(size)

self.dropout = nn.Dropout(dropout)

def forward(self, x, sublayer):

"Apply residual connection to any sublayer with the same size."

return x + self.dropout(sublayer(self.norm(x)))

コメント欄にも同じこと言及されてる